Skip to content

[spark] Add scan.maxRecordsPerPartition config to split log table input partitions#3260

Open
Yohahaha wants to merge 6 commits into
apache:mainfrom
Yohahaha:spark-split-partition
Open

[spark] Add scan.maxRecordsPerPartition config to split log table input partitions#3260
Yohahaha wants to merge 6 commits into
apache:mainfrom
Yohahaha:spark-split-partition

Conversation

@Yohahaha
Copy link
Copy Markdown
Contributor

@Yohahaha Yohahaha commented May 7, 2026

Purpose

Linked issue: close #3215

Brief change log

  • Introduce scan.maxRecordsPerPartition config option for Spark log table reads. When set, each Fluss
    bucket whose offset range exceeds this value will be split into multiple Spark input partitions, improving
    read parallelism for large offset ranges.
  • Update BucketOffsetsRetrieverImpl to support fetching real earliest offsets when needed.

Tests

SparkLogTableReadTest: "Spark Read: split partition by config"

API and Format

Documentation

@Yohahaha Yohahaha marked this pull request as ready for review May 7, 2026 02:52
@Yohahaha
Copy link
Copy Markdown
Contributor Author

Yohahaha commented May 7, 2026

@YannByron

@Yohahaha
Copy link
Copy Markdown
Contributor Author

Yohahaha commented May 7, 2026

@luoyuxia @fresh-borzoni PTAL!

Copy link
Copy Markdown
Member

@fresh-borzoni fresh-borzoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yohahaha Ty for the PR, overall LGTM, left minor comments, PTAL

@Yohahaha Yohahaha force-pushed the spark-split-partition branch from 75e67cc to 4bc7805 Compare May 13, 2026 10:53
@Yohahaha
Copy link
Copy Markdown
Contributor Author

@luoyuxia @YannByron more comments?

Copy link
Copy Markdown
Member

@fresh-borzoni fresh-borzoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yohahaha Ty for the changes, LGTM 👍 , just minor comment

@Yohahaha Yohahaha changed the title [spark] Add scan.max.records.per.partition config to split log table input partitions [spark] Add scan.maxRecordsPerPartition config to split log table input partitions May 15, 2026
@fresh-borzoni
Copy link
Copy Markdown
Member

@Yohahaha you need to update config key in tests to fix tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[spark] Add config to split input partition by input size

3 participants